Normalization and Pre-tokenization Simplified

Before we go deep into the three most common subword tokenization algorithms (Byte-Pair Encoding [BPE], WordPiece, and Unigram) used with Transformer models, let's first understand the preprocessing steps that each tokenizer applies to text. These steps include normalization and pre-tokenization.

>> Normalization

   Normalization is like a cleanup step for the text. It involves actions like:
    - Removing unnecessary spaces
    - Converting all letters to lowercase
    - Removing accents from characters (like turning "é" into "e")

   If you're familiar with Unicode normalization (like NFC or NFKC), this is something the tokenizer might also do.

   For example, let's use the 🤗 Transformers library to see this in action:

python code

'''
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
print(type(tokenizer.backend_tokenizer))
'''

This code loads a BERT tokenizer and gives you access to its internal workings. We can then see how it normalizes text:

python code

'''
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))
'''

This would output:

'''
'hello how are u?'
'''

   Here, the normalization step turned everything into lowercase and removed the accents.

   Try it out: Load a different version of the BERT tokenizer (like `bert-base-cased`) and see how it handles the same text. What differences do you notice?

>> Pre-tokenization

   After normalization, the next step is pre-tokenization. This step involves breaking down the text into smaller parts, usually words, that the tokenizer can then further split into subwords.

   For instance, a simple word-based tokenizer might split the text "Hello, how are you?" into words based on spaces and punctuation:

python code

'''
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are  you?")
'''

   This might give you something like:

'''
[('Hello', (0, 5)), (',', (5, 6)), ('how', (7, 10)), ('are', (11, 14)), ('you', (16, 19)), ('?', (19, 20))]
'''

   Here, each word and punctuation mark is identified along with its position in the original text.

   Different tokenizers handle pre-tokenization differently. For example:
    - The GPT-2 tokenizer keeps spaces and replaces them with a special character `Ġ` to preserve the original spacing.
    - The T5 tokenizer replaces spaces with an underscore (`_`) and splits only on spaces, not punctuation.

   Now that you have a basic understanding of normalization and pre-tokenization, you're ready to explore the tokenization algorithms themselves, starting with SentencePiece, followed by BPE, WordPiece, and Unigram.

>> SentencePiece Overview

   SentencePiece is a tokenization algorithm that works with any of the models we'll explore. It treats text as a sequence of Unicode characters and uses a special character (`▁`) to replace spaces. This approach is useful for languages that don't use spaces, like Chinese or Japanese. One key feature of SentencePiece is that it allows reversible tokenization, meaning you can easily decode the tokens back into the original text.